Diary 2024-01-14
The naive vector similarity method separates English and Japanese.
ベクトル空間で英語と日本語は分離してる.icon
For me personally, this is a big problem, so I'm not very satisfied with a simple embedded vector similarity
But very few people use both English and Japanese, so you don't have to worry about it.
But when I went to bed and woke up, I felt like, "Why don't we machine translate every English sentence into Japanese and load it into a vector index?
Better than making it into a vector and then thinking "how do I paste together the spaces that are far apart"?
I feel like I can't provide user value with that.
If it is Japanese who use it, the query is also usually in Japanese
But when I search for idioms, I get Chinese hits.
If so, it would be better to load the "Japanese" side of "other language -> Japanese" into the search index.
Then, turn the arrow in the opposite direction and read the original text after the Japanese hits if you want to.
The association of two chunks by machine translation is a kind of "link"
Hits by vector search are loose links
Compare with explicit human links in Scrapbox
Link meaning "same content, different language."
Scrapbox link to "Human Associations" link
What to record
---
This page is auto-translated from /nishio/日記2024-01-14 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.